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Abstract 



This paper presents an expository development of James— Stein estimation with 
substantial emphasis on exact results for non— normal location models. 

The themes of the paper are a) that the improvement possible over the best 
invariant estimator via shrinkage estimation is not surprising but expected from a variety 
uf perspectives; b) that the amount of shrinkage allowable to preserve domination over the 
best invariant estimator, is, when properly interpreted, relatively free from the assumption 
of normality; and c) the potential savings in risk are substantial when accompanied by 
good quality prior information. 
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1. Introduction • 

This paper presents an expository development of James-Stein estimation with 
substantial emphasis on exact results for non— normal location models. 

The themes of the paper are a) that the improvement possible over the best 
invariant estimator via shrinkage estimation is not surprising but expected from a variety 
of perspectives; b) that the amount of shrinkage allr able to preserve domination over the 
best invariant estimator, is, when properly interpre- , relatively free from the assumption 
of normality; and c) the potential savings in risk are substantial when accompanied by 
good quality prior information. 

Relatively, much less emphasis is placed on choosing a particular shrinkage 
estimator than on demonstrating that shrinkage should produce worthwhile gains in 
problems where the error distribution is spherically symmetric. Additionally such gains are 
relatively robust with respect to assumptions concerning distribution and loss. 

The basic problem, of course, is the estimation of the mean vector ^ of a p-variate 
location parameter family. In the normal case (with identity covariance) for p = 1, the 
usual estimator, the sample mean, is the maximum likelihood estimator, the UMVUE, the 
best equivariant and minimax estimator for nearly arbitrary symmetric loss, and is 
admissible for essentially arbitrary symmetric loss. Admissibility for quadratic loss was 
first proved by Hodges and Lehmann (1950) and Girschick and Savage (1951) using the 
Cramer-Rao inequality and by Blyth (1951) using a limit of Bayes type argument. 

For p = 2, the above properties also hold in the normal case. Stein (1956) proved 
admissibility using an information inequality argument. In that same paper however, Stein 
proved a result that astonished many and which has led to an enormous and rich literature 
of substantial importance in statistical theory and practice. 

Stein (1956) showed that estimators of the form (1 - a/(b+|lXH“))X dominate X for 



O 

ERIC 



7 



a sufficiently small and b sufficiently large when p > 3. James and Stein (1961) sharpened 
the result and gave an explicit class of dominating estim: tors, (1 - a/l|X|l")X for 
0 < a < 2{p-2). They also indicated that a version of the result holds for general location 
equivariant estimators with finite fourth moment and for loss functions which are concave 
functions of squared error loss. Brown (1966) showed that inadmissibility of the best 
equivariant estimator of location holds for virtually all problems for p > 3, and, in Brown 
(1965), that admissibility tends to hold for p = 2. Minimaxity for all p follows from Kiefer 
(1957). 



Section 2 gives a geometrical argument due to Stein which indicates that shrinkage 
might be expected to work under quite broad distributional assumptions. 

Section 3 gives an empirical Bayes argument in the normal case which results in the 
usual James-Stein estimator. 

Section 4 presents Stein's "unbiased estimator of risk" in the normal case and 
develops the basic theory for the standard James-Stein estimator in the normal case. 

Section 5 describes a variety of extensions of the basic theory to cover shrinkage 
towards subspaces, Bayes minimax estimation, non— spherical shrinkage, and limited 
translation rules. 



Section 6 considers extensions of the results of st'ctions 4 and 5 to scale mixtures of 
normal distributions. 
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SGCtion 7 prGSGnts gGiiGralizations to gGnoral spherically symmetric families of 
distributions, while section S indicates the applicability of earlier results to the multiple 



3 



observation case. 

Section 9 is concerned with results for non-spherical quadratic loss and for 
non— quadratic loss. 

Section 10 presents some additional comments. 

2. A Geometric Hint 

As in much of the development of the subject, the following rough geometric 
argument is basically due to Stein (1962). Consider an observation vector X in p 
dimensions with mean vector 6 and independent (or uncorrelated) components. Assume 
that the components have equal variance, The situation is depicted in figure 1. 

Figure 1 about here, 

Since E(X— 0)' 0 = 0 we expect X— 0 and 9 to be nearly orthogonal, especially for 
large \\9\\. Since EHXH^ = pa^ + 11^11^, it appears that X as an estimator of 9 might be too 
long, and that the projection of 0 on X or something close to it might be a better estimator. 
This projection of course depends on 9 and therefore isn't a valid estimator, but perhaps we 
can estimate it. If we denote this projection by (l-a)X the problem is to approximate a. 

One way to do this is to assume that the angle between 9 and X-9 is exactly a right 

ERIC 



9 



X 




9 9 

angle and to assume that ||X1|“ is exactly equal to its expected value 0' 0 + pcr“ and 
similarly ||X~^1|“ is equal to pa“. In this case we have |1Y|1“ = ||X— ^1|“ — a”HX||^ 

= pcr^- a^l|X||^ from triangle BCD and |1Y1|^ = - (l-a)^l|X||^ 

= |1X|1^ - pcr“ - (l-a)^llXll^ from triangle ABD. Equating these expressions we obtain 
pa^ - a^llXll^ = llXIl^ - pa^ - (l-a)2|lXH2 or (l-2a)l|Xl|2 = HXH^ - 2pa“. This gives 

9 

* 9 9 D 

a = pcr“/l|Xll and the suggested estimator is (1 -a) X = (1 — — 7 )X. 

ii^ir 

The above development does not particularly depend on normality of X or even that 
^ is a location vector. Unfortunately, it fails to be a proof of the inadmissibility of X, and 
also fails to distinguish between different values of p. It is however suggestive that the 
possibility of improving on the unbiased vector X by shrinkage toward the origin may be 
quite general. 

3. An Empirical Baves Argument 

The following well known Empirical Bayes argument also leads to the James-Stein 
estimator. The origins of this argument, which we only briefly sketch, is unknown (to us). 
It has appeared numerous times in print (e.g. Lehmann (1983) p. 299). 

Let X have a p— variate normal distribution with mean vector 6 and (for simplicity) 

9 

covariance matrix equal to a (known) times the identity. Suppose the prior distribution 
of B is normal with mean vector 0 and covariance matrix equal to b times the identity, 

ERIC 



11 



here b is an unknown scalar. The posterior mean (assuming for the moment that b is 

2 

known) is (— — )X = (1 % — )X. 

a +b <T^+b 

One way to estimate the unknown scalar b is the following. Since X— conditional 

c\ 

on B is normal with mean vector 0 and covariance c I, X— ^ and B are independent. Hence 

X = (X-^) + ^ is marginally distributed as a p-variate normal with mean vector 0 and 

covariance {(? + b)I. Therefore, l|XH“/(b + (?) has a central chi-square distribution with 

2 1 

p degrees of freedom. It follows that marginally, E(p-2)/HXll = -5 — and hence 

<T +b 

2 

(1 _ (p--)^-)X may reasonably be considered an Empirical Bayes estimator for the above 

iixir 

normal prior with unknown scale. This estimator of course is exactly the usual 
James-Stein estimator. 

4. Some Spherical Normal Theory 

Let X have a p— variate normal distribution with mean vector B and covariance 
matrix equal to the identity. The problem is to estimate B with loss equal to 



(4.1) 






i=l 



Stein (1956) showed for p > 3, that the usual estimator (5 q(X) = X is dominated by 
S , (X) = (1 - a(b+l|Xlft“^)X provided a is sufficiently small and b is sufftsiently large. 
James and Stein (1961) showed that 



(4.2) 






yX) = (l-allX|l -)X 
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dominates X for 0 < a < 2(p-2) and that a = p-2 gives the uniformly best estimator in 
the class. 

Their proof used the Poisson representation of the non— central chi— square 

distribution, but since the mid 1970's the "unbiased estimation of risk" technique of 

Stein (1981) been used and simplifies proofs substantially. 

The technique, which we describe below, depends on the following Lemma. 

LEMMA 4.1: Let Y ~ N(^,l), then E[h(Y)(Y-^)] = Cov(Y,h(Y)) = Eh'(Y) (provided e.g. 

that h(Y) is the indefinite integral of h'(Y), 1 im h(Y) exp[- ^(Y-0)^] = 0 and all 

Y->±oo 

integrals are finite). 

Proof; Integration by parts gives 



1 

V'Stt 



r h(Y)(Y-«) exp-[5(Y-«)Vy = 
•'-00 



1 

V2;r 



r h(Y)g4- (- exp(- ^Y-^^)dy 
“00 



Xh(Y)(exp[-^Y-d)^)i:;, 



+ J_r h'(Y)cxp[-^Y-«fldy 
^;r''-oo 

= Eh'(Y) 




□ 
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To obtain an unbiased estimator of the risk of write 



(4.3) R(«,ij(X)) = El|(l -^)X - «ir 

11X11 

= EllX - ^11^ + a“E-Uy - 

w Pir 



2 1 R 

= p + a^E - 2a E E( ^ 



lixr i=l 



t xj 

i=l J 



J 



= p + a“E-^-2aE E(-^(-4f)) (by the lemma) 

llXir i=l i SXr 



p Ex2 - 2X2 

= p + E !j — 2a t E 

W i=l 



(iixin^ 



p + a L rj- 

iixir 



= p + [a^ - 2a(p-2)j E 5 

iixir 



Note that the quadratic a^ - 2a(p-2) is negative in the range (0,2(p-2)) and attains 



its minimum at a = p~2. Hence, we have the following result. 

THEOREM 4.1 a. The estimator ^.(X) in (4.2) dominates X for 0 < a < 2(p— 2) 

a 
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(for p > 3). The estimator = (1 - has the uniformly smallest risk of any 

P - Y'Y 



estimator in the class. 



b. The risk of 5p_9(X) at ^ = 0 is equal to 2 for every dimension p > 3. 
The proof of part b follows by noting that for ^ = 0, HX|1^ has a chi-square 



distribution with p degrees of freedom and hence by (4.3) 



R(0,«p_ 2(X)) = P - = p 




Part b suggests, particularly for large values of p, that very large savings in risk are 
possible over the classical estimator in the region near ^ = 0 at no cost of increased risk 



Note also, that while substantial savings in risk are possible, the James-Stein 
estimator is itself inadmissible due to its strange behavior for small X'X. The shrinkage 
factor (1 becomes negative for X'X < p-2. A better estimator is given by the 

"postive-part" estimator (1 - Interestingly, this estimator is itself inadmissible 

because it fails to be generalized Bayes, but to the authors' knowledge, no improved 
estimator has been found. 

^ More Normal Theory 

In the previous section we showed that the James— Stein estimator dominates the 
classical estimator in the (identity covariance) spherical normal case and that its risk at 



elsewhere. 
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the origin is equal to 2 regardless of the dimension of the problem. One may interpret this 
result as saying that if your prior guess that the origin is the true value is correct you may 
save substantially, but if you are wrong you loose nothing. This suggests, and it is 
obviously true, that a similar result holds for any arbitrary origin 0^. That is, the 
estimator 

will dominate the usual estimator and have risk equal to 2 at 6q provided p > 3. 

Suppose it is believed that e lies in V, an s dimensional subspace of R^. Then 
letting the projection of X onto V be V and W = X-V, the projection of X onto the 
ortho— complement of V, we have the following result. 



THEOREM 5.1 The estimator [(5.2) V + (1 - jdominates X, and has risk equal 

to s + 2 for all 6 in V, provided p - s > 3. This is perhaps most easily seen by considering 
a canonical version, when V represents the first s coordinates and W the remaining p-s 
coordinates. The estimator (5.2) then uses the classical estimate on V and the 
James-Stein estimate on the (p-s) dimensional subspace W = - V. In general the 

result follows by noting that V and (1 - are independent (and orthogonal) 

and that the estimation problem breaks up into two orthogonal components {6 =i^+u, 
veW oeW). 

One particularly important application of this idea is the estimator proposed by 
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Lindley (in discussion of Stein (1962)) where V = {ft = <?., =...= 0p}, dim V = 1, and the 
estimator (5.2) "becomes 

(5.3) Xl + (1 (P-3) ) (X-Xl) 

E(Xj-X)^ 

with 1 = (1,...,1). 



As the preceding discussion and section 2 indicate there is a strong Bayesian 

connection to be made. In particular we indicated in section 2 that the James— Stein 

estimator could be viewed as an Empirical Bayes estimator for a normal prior with mean 0 

2 2 

and covariance matrix cr I, with cr unknown and estimated for the data. Strawderman 
(1971) established a more formal Bayesian (as opposed to Empirical Bayesian) connection 
along these lines. We now describe this development. 

First, an extension of the basic James-Stein result due to Baranchick (1964, 1970) is 
helpful. 



LEMMA 5.1: The estimator (1 - is minimax for the loss (4.1) provided 

0 < r(-) < 2(p-2), and r(-) is monotone increasing. 

Proof: The proof in the case satisfies the conditions of lemma 4.1 

essentially follows that of Theorem 4.1. By Lemma 4.1 
E(X-tf)'X = (P-2) E + 2E 



O 

ERIC 



• I 

1 1 



11 



Hence 



<P + [2(p-2)-2(p-2)]e4^ = P 



(X-^)'X 

X'X 



r(X'X) 



This lemma allows smoother shrinkage factors than the positive part James— Stein 
estimators and opens the possibility that generalized Bayes and perhaps proper-Bayes 
estimators may be found in the class (other than X itself). To this end, consider a two 
stage prior for 9 such that at the first stage A ~ N(0, I), and at the second stage 

A ~ (l-a)A~^ (for a < 1). Then the Bayes estimator is given by 



(5.5) E(0|X) = E[(E^1X,A)1X] 

= E[(1- = 

S(I/) 

A straightforward calculation gives 

, 2 exp (-i X'X) 

(5.6) , E(A 1 X) = X ' )(|[P+2~2a ~ j '] 

/Ja ^ ^~%xp( - ^X'X)dA 




Where r(X'X) is defined to be the term in brackets on the right side of (5.5). Since 

lp_a 

r(X'X) < p+2-2a, and since /Ja ^ exp{- ^(l-A)]dA is increasing, the conditions of 
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Lemma 5.1 will be satisfied provided p+2-2a < 2(p-2) or equivalently 



(5.7) a > ^ 



^ 

Since in order for A (the second stage prior) to be integrable we must have a < 1, it is 



seen that (5.7) is satisfied for such an a provided that p > 5. Hence we have 
THE0R.E.M 5.2. a) For p > 5 the proper— Bayes estimator (5.5) is minimax provided 
a > 2^6— p). 

b) For p > 3 the estimator (5.5) is generalized Bayes and minimax provided 
^(6— p) < a < j(p+2). 

The proof of b) follows by noting that (5.6) makes sense (i.e. the generalized Bayes 
estimator exists) provided ^ p-a > —1, which is equivalent to the right inequality. Note 
that the double inequality holds only if p > 2. Strawderman (1972) showed that no proper 
Bayes minimax estimators exist for p < 5. 

We briefly take a broader view and describe a result of Stein (1981) concerning 
minimaxity of (generalized) Bayes estimators. If z{6) is the (generalized) prior density, 



then the (generalized) Bayes estimator is given by 



(5.6) 



(X) = 

' - iiix-«ir 

/ e ^ T(«)d(l 
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= X + V log f(X) = X + 



where V = (^, (essentially) the posterior density of 9 given X. 

12 p 

An easy application of Lemma 4.1 gives the following very general unbiased estimate 
of risk for a nearly arbitrary estimator of the form X+g(x). 

LEMMA 5.2 Let ^(X) = X+g(X) be such that g(-) is almost differentiable and such that 

^ ElV.g.(X)l < 00 . Then 
i=l ^ ^ 

EllX+g(X) - ^11^ = p + E^(llg(X)l|^ + 2Vog(X)j. 

Hence if llg(X)ll^ + 2Vog(X) < 0 for all X then X + g(X) dominates X. 

(Note that Theorem 4.1 and Lemma 5.1 are special cases). 

Application of the lemma to a (generalized) Bayes estimator of the form (5.6) gives 



(5.7), R(WP=EIlX + ^-«f 



p + + 2 f(X)V^f(X) - lIVtmil!] 



r=(x) f^(x) 
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< p + E 



V^f(X 

“T(X 



f Actually the Denultimate expression can be simplified to equal p + 4E ^ where 

yr(xy 

D 2 

V^f(X) = E — I — f(X), the Laplacian of f( • )). 
i=l rXj 

A function f(X) such that V^f(X) < 0 V X is called superharmonic. It has the 
property that the average of the function over a sphere of radius r about a point Xq is 



never greater than f(Xg) for all Xq and r > 0. Further it is easily seen that convex 
combinations of superharmonic functions are superharmonic. It follows then that if ir(^) is 



- 2l|X-«l|^ 

a superharmonic prior, then f(x) = / — j(d)d0 is also superharmonic. We then 

have the neat result. 

THEOREM 5.2. If Tr{d) is superharmonic,then the estimator (5.6) is minimax. 

Incidentally, note that if n^{6) is superharmonic for each 7 , then so is 
t^{ 6) = } ir^{9) dP( 7 ) for any distribution F( • ). This opens up a nice class of multiple 
shrinkage minimax estimators due to George (1985). 

To conclude this section we present three examples which illustrate the utility of 



lemma 5.2. 
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EXAMPLE 5.1 (Stein 1981) 



(5.8) 



^(X) - X - g-, 



Then 



E\\6{X) - ef = p + E[(f 



X-A^X 

(X'BX)^ 



2 Vo 



/?AX 

X'BX 



■] 



= P 



+ ni^- 



X-A^X 

(XBX) 



2/?t rA 4/?X'ABX, 
(X'BX)*-^^ 



(provided the expectations exist). 

If A is a fixed symmetric matrix, B = [(tr A)I — 2A] ^A^, and 2A < (tr A)I (the 
largest eigenvalue of A is less than j the trace of A). Then 



E\\6{X)-ef = p + {lf-20)E 



X-A^X 
(X'BX)‘^ ‘ 



Hence S dominates X provided 0 < /? < 2. Furthermore /? = 1 is the uniformly best choice. 
Stein (1981) gives an intereting application of this result to three term symmetric 



moving averages of the form 

«. = X.-A(X)(Xj-^X._, + X.^|)) 



(6.9) 
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where Xq = Xp. Here 





if j — i -±1 (mod p) 
if j ~ i 0 (mod p) 
otherwise 



The characteristic roots are 1 — cos(2t ^) < 2 with -[^] < j < [^] giving tr(A) = p. Hence 

AX 

for p > 5 (5.9) dominates X for A(x) = as in (5.8). 

EXAMPLE 5.2 (Berger 1980). 

Let the generalized prior density be given by 

/J[det /?(A)] ^ exp[- A^ ^ ^ dA 

where /?(A) = A~^ C— I for 0 < A < 1, C-I positive definite and n > 0. Here the distribution 



of ^ given A is N(/;,(B(A)) and reduces to the two stage prior of Strawderman discussed 



earlier in this section if C = I. The (generalized) Bayes estimator is given by 



r f(X - fi)'C~\x - (i))C 

6 (X) = /i + (I - ^ )(X - m) 

“ (X - rt'C-‘ (X - 



-,-1 



(5.10) 



where 




V/q A^ exp(-(^)dA 
/q A^ ^exp( - ^)dA 



= 2n(l - [n /J exp(ii^ dAf') 




23 



It follows from Lemma 5.2 {as in Example 5.1) that if (2+n) ^ < tr C ^ then 

6 (X) is minimax. 
n^ ' 

EXAMPLE 5.3 (Stein 1981). 

Efron and Morris (1971, 1972) considered estimators which modified the James-Stein 
estimator by requiring that no coordinate moves by more than a preassigned quantity C. 
Stein gave an alternative "limited translation" rule based on order statistics as follows. Let 
Zj = IXjl and < Z^ 2 ) ^(p)’ statistics. Fix K a positive integer (a 

large fraction of p) and consider 5(X) = X + g(X) when 



= { 



a 

TTnT 



E(XjAZ 



(K) 



S(X^^Z(K) 



Xj if IXjl 

Z(K) sgn Xj if IXjl > Z^j^^ 



where a A b = min(a,b). 



Application of Lemma 5.2 gives 



IS 



EP(x) - ef = P + -2(K-2)a] E| J -- ,-] 

S(XjAZ(K) 

Hence the estimator is minimax if 0 < a < 2(K— 2) and the uniformly best choice of a is 
K-2. 

6i Scale Mixtures Of Normals 

Stein (1956) showed that the usual estimator of a location vector could be improved 
upon quite generally for p > 3 and Brown (1966) stubstantially extended this conclusion to 
essentially arbitrary loss functions. Explicit results of the James-Stein type however were 
restricted to the case of the normal distribution, Strawderman (1974) considered scale 

2 

mixttires of multivariate normal distributions as follows. Let X have density f(l|X — ^|| ) 
where 



( 6 . 1 ) 



t(lix - d|A = (V57)-P ; exp[- -1^ lix - S|1 V‘‘’dG(a) 



where G( • ) is a known distribution function. The object is to estimate 9 with loss (4,1). 

Such a random variable X clearly has the interpretation that given cr, X is normal 

9 

with mean vector 6 and covariance matrix a I. The unconditional distribution of a is 
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G(-). This interpretation together with Lemma 5.2 aiiows the foiiowing caiculation of the 
risk of a smooth estimator X + g(X). 



( 6 . 2 ) 



EllX + g(X) - ^11^ = ""IIX + g(X) - ^11^ 1 

‘a ' a 



G 

a 

= n E’^tr^ + E'E^I '’lllg(X)||^+ 2<r2v-g(X)l (7) 



where ” V-g(u*tr)ly _ 

"a 



For estimators of the James-Stein type, g(X) = - and, V • g(X) - ^ 



and hence 



Ell(l - x^)X - ^11^ = 

= + E'^iC^ - 2a(p-2))E^' 



Note that X'X/a^ given (P" is distributed as a non-central x P degrees of 
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0/ Q V j ^ 

freedom and non-centrality parameter Hence E ' <^) is an increasing function 

<J 

2 2 
of a . Since - 2a(p-2) is decreasing in a we have 



(6.3) 



2 2 

E||(l - y^)yi - of < pE"^<T^ + - 2a(p-2)]E'^E^I'^[^i<7] < pE&^ 



provided 0 < a < 2(p— 2)/E^ -j = 2 /Eq . Therefore we have the following result. 

<7 

A 

THEOREM 6.1. Let X have the distribution 6.1 for p > 3. Then the estimator (1 — 
dominates X (for the loss 4.1) provided 0 < a < 2 /Eq(^^^). 

It is interesting to note that this result reduces to Theorem 4.1 a if the distribution 
of a is degenerate at <7 = 1. Furthermore, the shrinkage factor a = 2 /Eq(^j^ 4^) is an upper 
bound for any distribution such that each coordinate has mean 0, as an easy calculation 
shows. What is remarkable about Theorem 6.1 is that if the shrinkage factor is interpreted 
properly, the James-Stein result extends directly to the entire class of scale mixtures of 
normal distributions. 

2 2 

Note that this class includes (if 1 /<t ~ xj) family of multivariate -t 

distributions with tails of the order (1 + 9' 9) as well as the family of normal 

distributions. 
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The geometrical argument of section 2 which hinted at shrinkage factors of the order 
of p<7^ regardless of normality is thus validated for a wide class of distributions. 

Chou and Strawderman (1986) extended this result to include estimators of the type 
studied in Lemma 5.2. Here is a simple form of their result. 

THEOREM 6.2. Let X be as in Theorem 6.1, and let g(X) be such that 

a) !lg(X)l!2 + 2V.g(X)<0 

b) g(bX) = ^(X) (g( • ) is homogeneous of degree -1) 

c) {X; llg(X)|l^ > c} is convex for each c > 0. 

2 

Then X + ag(X) dominates X for 0 < a < 2/E(l / a ). 

Proof; By (6.2) 

EllX + ag(X) - ^11^ = pE(7^ + E'^E^l VllsWll^ + 2a<7^V-g(X)la] 

< pE<7^ + E'^E^ ' ‘^lllg(X)lfta^ - 2a<7^1 1 a^\ 

2 

= pE<7^ + E‘^E^'‘^lilg(|)ift^ - 2a]l<72] 

<7 

< pE(7^. 

The last inequality follows since ElIg(~)H^ is increasing in a (by Anderson's Theorem) and 

- 2a] is decreasing in <7^, and E[^ - 2a] < 0 if 0 < a < 

Hence versions of the estimators of Example 5.1, 5.2 and 5.3 extend to the scale & 




mixture of normal families. 
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L. Other Spherically Symmetric Distributions 



Extensions of James— Stein type results to distributions other than scale mixtures of 
normal distributions are due to Berger (1975), Brandwein and Strawderman (1978), 



Brandwein (1979), and Bock (1985). 



We present a new proof of Brandwein's result which gives shrinkage factor for p > 4 
which holds uniformly for all spherically symmetric distributions such that EX'X < oo and 
E(X'X)~^ is fixed. The factor given is the best possible and is attained for uniform 



distributions concentrated on a spherical "shell" of radius R. 



The spirit of the proof is to first obtain the result for spherical shells and to extend 
it by use of a technical lemma to mixtures of such distributions. Since the class of 
spherically symmetric distributions is precisely those obtained by scale mixtures of 



spherical shells, the desired result follows. 



Suppose X has a p— variate spherically symmetric density of the form f(||X-^|l ). 



Then X = ^ + U where U has density f(||U|r). 



We V, . i need the following facts: 



FI: V = 






2 = (cos(^,U))2 



1 P-K rnTTM2 



has a Beta (j, ^) distribution independent of ||U1| (see Dempster (1969) p.272). 



F2: If p(v) is the density of V then 
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p^(v) = c(7)p(v)[l + 7-7^”^ and p^(v) = c (7)p(v)[l + 7-j^ 
have monotone likelihood ratio decreasing for 0 < 7 < 1 and increasing for l< 7 < ® 
(Proof: easy calculation). 

We will prove 

THEOREM 7.1: If X has density f(||X-«|ft and Ellxf and E||X|r^ are finite then for p > 4, 
( 1 _ ^/ <^)X dominates X for 0 < a < 2(2 ^)[EqHX| 1 for loss (4.1). 

Proof: R(^,(l— = 

(7.1) E[I|(1 - y^) X - ^11^] = E11X-0|1^ + a^E ^ - 2aE ^ ’ 



Let X = U + ^, U'U = and use the fact that the distributions of U and U coincide to 



get 





= ER^ + E‘ 



a^(R^ + J'9) 
(R^ + 



U'^y-ujL___i 

+ d'H ^ 2'U''5 ^ 

_ 2afR^fR^ + f'$) -2{$'l])h 
(R^ + -i{eW 



Since R(^,X) = EjjX - of = ER^ we have, letting V = 



(0 < V < 1) 

Piriiur 




30 



24 



(7.3) 



I 



R{^,(1- x^)X)-R(W 




< EE 



- 2aR^(l - 2V) 



HR“ + 9'9) - Ae'&9i^w[e'e + r^] 



:tIr) 



= EE 



{ 



^ - 2a{l-2V) 
R^ 



1 + 



TT 

IF 



R^ 



+ 



R^ 



=T 




Now using fact F2 that if p(v) is the density of V then 



I + 



c(-^)p(v) 
R 

TT 






BTj— I = Vq> q{^) has monotone decreasing likelihood ratio if 

W r2 R^ 



0'9 

W 



R* R 

0^0 

< I and monotone increasing likelihood ratio if — > I to conclude that 

R 




El: 



V 



1 + -^ - 4 ^ 
R^ R'^ 






_ 4i^Vll+ 

r2 R^ 




< max! 



{[ vp.(v)dv, li 

Jo ^ r 



1 im 

7^00 



[ vp (v)dv] 

Jo ' 
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The last equality follows since p^(v) = 

■1 1 

vPi/»dv= vp(^(v)dv = 

7-^00 *^0 

is a Beta (i, ^). 



1 im f^p (v)dv = 1 im f^p,/ Jv)dv = [ vpQ(v)dv = Here we also use FI that Pq(v) 
^00 JO 'v^oc.JO 'I ^ JO 



Hence combining (7.3) and (7.4) 



(7.5) H(^,(l- 



- 2a(-2^) 

a - R(«,X) < Ee{ 

r2 






IR 



We will show below that 



E - 
Ll 



■1 ^'k'W I ^ 






+ 



"FT 




is decreasing in and hence increasing in R^ for fixed H^H^. This together with the fact 
R 

2 o 2 . . 

that - 2a(2^) is decreasing in R implies that 

R^ P 
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provided a^E — 2 (^-^) < 0 . The Theorem follows since En— -■ = E — \ 



R 






R- 



It remains to show E[- - a, ^ If '4 n~\ ^ decreasing in to 

R^ " ^ ^ ^ 



R^ 



R" 



R" 



complete the proof. Note first 



d 



1 



(7.6) 4 f —Blv}dv ^ 

~J0l+7 - 4 tv[ 1+7] ^ (1+7)^ JQ [1+7 - 4^[i+^] ‘] 






Clearly if 7 > 1 this derivative is negative. To prove ( 7 . 6 ) is negative in the range 
0 < 7 < 1 use F2 and show 






4v - (7+1)"^ 

0 [1+7 - 47v[1 + 7]“] 



^ p(v)dv < c*( 7 )j 



4v - 1 






0 [ 1+7 - 4tv[1+7] ] 



p(v)dv 



< [ (4v-l)p(v)dv = (^ - 1) < 0 if p > 4 where c*( 7 ) = [f rr-dv] K 

■'0 Jo(l+7 - 4 tv[1+-,]-‘P 

This completes the proof. 

We have noted that the factor 2(2 ^)/Eq| 1X||“^ is the best possible constant which 
holds uniformly for all spherically symmetric distributions with Eq||X||~^ fixed. For 
specific distributions, obviously, better results are possible (see Bock (1985)). It is 
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remarkable however, that the best possible constant for any distribution can be no larger 
than 2 /Eq||X||~^ as can be easily seen by calculating the risk at 0. Hence the factor given 
in Theorem 7.1 which applies uniformly is surprisingly close to the best that can be 
attained for any given distribution. 

We note for completeness that the results of Theorem 6.1 for mixtures of normals 
and Theorem 7.1 for spherically symmetric distributions can be extended to prove 
minimaxity of estimators of the form (1 — conditions on a are as in their 

respective theorems and the conditions on r( • ) are : a) 0 < r( • ) < 1; b) r(Y) is monotone 
nondecreasing; and c) r(Y)/Y is monotone nonincreasing. 

8i Multiple Observations 

So far we have concentrated our attention on improving the estimator X based on a 
single observation from a population with density f(|lX— ^|| ). Suppose we have a sample 
Xj,..,Xj^ from such a population and the problem is to estimate the p— dimensional vector 9 

with loss (4.1). 

In this case the natural estimator is Pitman's estimator, one version of which is 
given by ^(X,Y) = Xj - Eq[Xj| Y], where Y = (X^ — Xj^, X 2 - — X^). This 

estimator which is minimax and best among equivariant estimators is inadmissible if p > 3. 
For n > 2, X, the vector of sample averages, is Pitman's estimator if and only if the 
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population is normal. For non-normal populations, Pitman's estimator is typically 
difficult to calculate and its distribution tends to be analytically intractable. A variety of 
other estimators such as X, or the M.L.E. might be used instead, all of which are in the 
class of estimators equivariant under the orthogonal and location groups. That is for any 
such estimator 

^(QXj - c, QX 2 - c,...,QXj^ - c) = Q ^(Xj,...,Xj^) - c. 

Since the sampling distribution of any such estimator (when sampling from a spherically 
symmetric distribution) is itself spherically symmetric. Theorem 7.1 applies and we may 
conclude that ^(X,,...,X„) is dominated by (1 jc) ^(X,,...,X ) for 

0 < a < (2(p-2)]/[pE(,||^(X,,....X^)ir2l. 

9. Other Loss Functions 

There are two major lines of development relating to generalizations concerning the 
loss function (4.1). The first is to consider general quadratic loss given by 

(9.1) L(^,^) = (^^)'D(^^) 

where D is a given pxp positive definite matrix. The second relates to non-quadratic loss. 
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The earliest results for loss 9.1 in the normal case are due to Bhattacharya (1966). 

An early representative result is the following due to Bock (1975). Let 

1 1 

(9.2) 6{X) = [I - aD ~ ^CD^UXlf^jX 

1 1 

5 5 

where C is a known positive definite matrix. Let largest eigenvalue of D CD 

1 1 
5 2 2 

and be the largest eigenvalue of D C D . 

THEOREM 9.1. Let X ~ Np(^,I), then the estimator (9.2) dominates X under loss (9.1) 
provided 0 < a < 2[trCD — 2^^]/ 7 l* 



Proof: 



R( W - 

( A' A j 



1 1 

2r.X'D^C^D^X 

a b iy— 

(X'X)^ 



D^C D^: 
'JX7X 



(by lemma 4.1 as in example 5.1) 




E 



X-D^C^D^X 

(X'X)‘^ 



] —2a 



fX'X trD^C D^- 

T~ 

(X'X)^ 



2X' 



1 1 
D^CD^Xl 
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<E^[a27L-2a(tr CD -2^^)] 



< 0 . 



There are a number of results of this type for estimators of the form 



(9.3) 



[I - which may be proved in much the same way. 



It is worth noting that the problem of estimating the mean vector 0 when 
X N(^,E) with E known, loss given by (9.1), and estimators of the form (9.3) is essentially 
reducible to the case E = I. In this more general setting a variety of justifications for 
different choices of B and C in (9.3) have been given from the robust Bayesian perspective 
(Berger (1982)) from the ridge regression perspective (Thisted (1976), Strawderman (1978), 
Draper and Van Nostrand (1979), Casella (1980)), and from an empirical Bayesian 
perspective (Efron and Morris (1973, 1975), Morris (1983)) among others. 

A variety of results covering non-normal situations have been found by Berger 
(1975) and Chou and Strawderman (1986) in the scale mixture of normal case, and by 
Brandwein and Strawderman (1978) in the spherically symmetric unimodal case, and by 
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Brandwein (1979) in the spherically symmetric case. 

In particular, Brandwein's (1979) result for X from a spherically symmetric 
distribution replaces the upper bound for a in Theorem 9.1 by 
(trcD-2?L)7LlEo(X'X)i“‘ 

In the normal case this is (p-2)/p times the upper bound in Theorem 9.1 and again, 
the degree of shrinkage allowed is relatively unaffected by the assumption of normality. 

Results concerning extensions to non— quadratic loss are relatively few. Berger 
(1978) has results in the normal case for polynomial loss. Brandwein and Strawaerman 
(1980) and Bock (1985) have results for losses of the form 

(9.4) L(«,^) = fdIMft 

where f( • ) is an increasing concave function. Here is a version of Brandwein and 
Strawderman's result. 

THEOREM 9.2. Let X have a spherically symmetric distribution with p > 4. Then 

2 

^X) = (i dominates X for the loss (9.4) provided 0 < EQf'(R ) < oo and 

0 < [2(p-2)]/(pEjjR”^] where G(-) is the cdf of R = 1|X-5H and 

_ /^f'(s^)dG(s)//Qf'(s^)dG(s). Eq and E^j denote expected values under the cdf 
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G(‘) and H(*) respectively. 

Proof: We use the fact that f( • ) concave implies 
f(llX-^||^ + u) < f(||X-^ift + uf'(||X-^||“) to obtain 

(9.5) R{e,6)-R{9,X) 

= E[f(llX-^ll^ + a^/X'X-2a(X-^)'X/X'X) -f(|lX-^lft] 

< E{f ' ( llX-^lft [a^/X ' X - 2a(X-^) ' X/X ' X] } 

= EE[(a^/X'X-2a(X-^)'X/X'X)f'(R2)llX-^|l = R] 

But it follows from Theorem 7.1 that the last expression in (9.5) is negative provided 
0 < a < 2(p-2)/[pEjjR~^]. Hence the theorem follows. 

The proof for estimators of the form (1 - where r( •) satisfies the 

conditions of the remarks following the proof of Theorem 7.1, is essentially identical to the 
above proof. 

As an application of this result to the spherical normal case, let X ~ Np(^,I), 
h{9,S) = 11^^11^, 0 < q < 2. Hence f(u) = u^/^, f'(u) = q/2 and since 

r 2 _ iix-^ll^-v Xp, EjjR“^ = EqR“^'^^“^/EqR^~^ = (p+q-4)“l Hence we are assured 
that the estimator (1 - a/X'X)X dominates X for 0 < a < 2(p-2)(l - (4-q)/p), for p > 4. 
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The most important results for non-quadratic loss are those for confidence set 
estimation. Hwang and Casella (1982) showed that the usual spherical confidence set 
centered at X may be dominated by one centered at an appropriate (positive— part) 
James-Stein estimator. Hwang and Chen (1986) extend domination results for confidence 
sets centered at positive-part James-Stein estimators to non-normal settings. 

10. Comments 

1. Unknown Scale. We have assumed throughout that the scale is known. For the 
Np(0,(T^I) distribution, if an estimator of (P" is available which is distributed as a multiple 
of a chi-square distribution independently of X this case causes no difficulty. The original 
James-Stein paper treats this problem as do several others. In the non— normal case much 
less is known. One can make some progress if an independent estimate of the scale is 
known (at least in the mixture of normal case (see Bravo and MacGibbon (1987)) but such 
an assumption seems unwarranted generally. 

2. Non Snhericallv Symmetric Distributions,. The discussion of section 9 can be extended 

to handle distributions of the form i{{X-eyr\x-$)) where S is a known positive-Klefinite 

_1 

matrix by working with the random vector E ^X which has a spherically symmetric 
distribution. In cases where the whole problem is not spherically symmetric a tension 

o 
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betwe6n being Bayes (doing well on the average) and being minimax (nev'er doing worse 
than the best invariant estimator) often develops. It typically happens that minimax 
estimators will shrink coordinates with larger variances relatively less than will Bayes 
estimators. The phenomenon is complicated by the fact that for quadratic loss, the 
minimax estimator will depend on the choice of D in (9.1) while the Bayes estimator will 
not. See Berger (1985) and references therein for more details. The current 
recommendations for choice of shrinkage procedures in such situations seems to favor a 
Bayesian or Empirical Bayesian basis as opposed to a purely minimax one even among 
more classically oriented decision theorists. This seems to be at least partly on the grounds 
that minimaxity may be too strict a requirement here, and that relaxation to something 
like £-minimaxity might preserve the large gains possible (near the origin, say) at a slight 
cost for "large" values of 9 in certain directions. 

3. Independent Coordinates. It can be argued that a much more natural class of problems 
than the ones we have been considering are those non-normal location problems where the 
coordinates are independent. Since sphericity and independence implies normality we 
have, unfortunately described no results for the non-normal case. Shinozaki (1984), Miceli 
and Strawderman (1986,1988) have some results for independent non— normal observations 
but the results are not nearly as extensive as for the spherical case. 
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4. Annlications . We have said little about applications of James-Stein estimation. Efron 
and Morris in a*series of papers (1971,1972, 1975, 1976, 1977 and others) fostered the 
application of shrinkage estimation and addressed a number of practical considerations 
including the unequal variance case, shrinkage in groups, and limited translation 
estimators. Most of the published applications have had an empirical Bayes orientation. 
For some examples the reader is referred to Efron and Morris (1973,1975), Casella (1985), 
Green and Strawderman (1985,1986) and Braun et al (1983). 
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